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Abstract 

The purpose of statistical disclosure control (SDC) of microdata, a.k.a. 
data anonymization or privacy-preserving data mining, is to publish data 
sets containing the answers of individual respondents in such a way that 
the respondents corresponding to the released records cannot be re- identified 
and the released data are analytically useful. SDC methods are either 
based on masking the original data, generating synthetic versions of them 
or creating hybrid versions by combining original and synthetic data. The 
choice of SDC methods for categorical data, especially nominal data, is 
much smaller than the choice of methods for numerical data. We mitigate 
this problem by introducing a numerical mapping for hierarchical nominal 
data which allows computing means, variances and covariances on them. 
Keywords: Statistical disclosure control; Data anonymization; Privacy- 
preserving data mining; Variance of hierarchical data; Hierarchical nomi- 
nal data 



1 Introduction 

Statistical disclosure control (SDC, [H HI [HI ISl [S] ) , a.k.a. data anonymization 
and sometimes as privacy-preserving data mining, aims at making possible the 
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publication of statistical data in such a way that the individual responses of 
specific users cannot be inferred from the published data and background knowl- 
edge available to intruders. If the data set being published consists of records 
corresponding to individuals, visual SDC methods operate by masking original 
data (via perturbation or detail reduction), by generating synthetic (simulated) 
data preserving some statistical features of the original data or by producing 
hybrid data obtained as a combination of original and synthetic data. Whatever 
the protection method chosen, the resulting data should still preserve enough 
analytical validity for their publication to be useful to potential users. 

A microdata set can be defined as a file with a number of records, where 
each record contains a number of attributes on an individual respondent. At- 
tributes can be classified depending on their range and the operations that can 
be performed on them: 

1. Numerical. An attribute is considered numerical if arithmetical operations 
can be performed on it. Examples are income and age. When designing 
methods to protect numerical data, one has the advantage that arithmeti- 
cal operations are possible, and the drawback that every combination of 
numerical values in the original data set is likely to be unique, which leads 
to disclosure if no action is taken. 

2. Categorical. An attribute is considered categorical when it takes values 
over a finite set and standard arithmetical operations on it do not make 
sense. Two main types of categorical attributes can be distinguished: 

(a) Ordinal. An ordinal attribute takes values in an ordered range of 
categories. Thus, the <, max and min operators are meaningful and 
can be used by SDC techniques for ordinal data. The instruction 
level and the political preferences (left-right) are examples of ordinal 
attributes. 

(b) Nominal. A nominal attribute takes values in an unordered range 
of categories. The only possible operator is comparison for equality. 
Nominal attributes can further be divided into two types: 

i. Hierarchical. A hierarchical nominal attribute takes values from 
a hierarchical classification. For example, plants are classified 
using Linnaeus's taxonomy, the type of a disease is also selected 
from a hierarchical taxonomy, and the type of an attribute can 
be selected from the hierarchical classification we propose in this 
section. 

ii. Non-hierarchical. A non-hierarchical nominal attribute takes val- 
ues from a flat hierarchy. Examples of such attributes could be 
the preferred soccer team, the address of an individual, the civil 
status (married, single, divorced, widow/cr), the eye color, etc. 

This paper focuses on finding a numerical mapping of nominal attributes, 
and more precisely hierarchical nominal attributes. In addition to other con- 
ceivable applications not dealt with in this paper, such a mapping can be used 
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to anonymize nominal data in ways so far reserved to numerical data. The in- 
terest of this is that many more SDC methods exist for anonymizing numerical 
data than categorical and especially nominal data. 

Assuming a hierarchy is less restrictive than it would appear, because very 
often a non-hierarchical attribute can be turned into a hierarchical one if its 
flat hierarchy can be developed into a multilevel hierarchy. For instance, the 
preferred soccer and the address of an individual have been mentioned as non- 
hierarchical attributes; however, a hierarchy of soccer teams by continent and 
country could be conceived, and addresses can be hierarchically clustered by 
neighborhood, city, state, country, etc. Furthermore, well-known approaches 
to anonynimization, like fc-anonymity [7], assume that any attribute can be 
generalized, i.e. that an attribute hierarchy can be defined and values at lower 
levels of the hierarchy can be replaced by values at higher levels. 

1.1 Contribution and plan of this paper 

We propose to associate a number to each categorical value of a hierarchical 
nominal attribute, namely a form of centrality of that category within the at- 
tribute's hierarchy. We show how this allows computation of centroids, variances 
and covariances of hierarchical nominal data. 

Section[2]gives background on the variance of hierarchical nominal attributes. 
Section |3] defines a tree centrality measure called marginality and presents the 
numerical mapping. Section 2] exploits the numerical mapping to compute 
means, variances and covariances of hierarchical nominal data. Conclusions 
are drawn in Section [SJ 

2 Background 

We next recall the variance measure for hierarchical nominal attributes intro- 
duced in [2. To the best of our knowledge, this is the first measure which 
captures the variability of a sample of values of a hierarchical nominal attribute 
by taking into account the semantics of the hierarchy. The intuitive idea is that 
a set of nominal values belonging to categories which are all children of the same 
parent category in the hierarchy has smaller variance that a set with children 
from different parent categories. 

Algorithm 1 (Nominal variance in [2]) 

1. Let the hierarchy of categories of a nominal attribute X be such that b is 
the maximum number of children that a parent category can have in the 
hierarchy. 

2. Given a sample Tx of nominal categories drawn from X, place them in the 
tree representing the hierarchy of X. Prune the subtrees whose nodes have 
no associated sample values. If there are repeated sample values, there will 
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be several nominal values associated to one or more nodes (categories) in 
the pruned tree. 

3. Label as follows the edges remaining in the tree from the root node to each 
of its children: 

• If b is odd, consider the following succession of labels Iq = (b — l)/2, 
h = (&-l)/2-l, h - (fo-l)/2+l, h - (fo-l)/2-2, h = {b-l)/2+2, 
■■ ■ , /b_2 = 0, lb-1 =6-1. 

• If b is even, consider the following succession of labels Iq = (b— 2)/2, 
h = (6-2)/2+l, h = (6-2)/2-l, ^3 = (&-2)/2+2, h = (6-2)/2-2, 
• • • , lb-2 = 0, lb-1 =6-1. 

• Label the edge leading to the child with most categories associated 
to its descendant subtree as Iq, the edge leading to the child with 
the second highest number of categories associated to its descendant 
subtree as h, the one leading to the child with the third highest number 
of categories associated to its descendant subtree as I2 and, in general, 
the edge leading to the child with the i-th highest number of categories 
associated to its descendant subtree as U-i- Since there are at most 6 
children, the set of labels {lo, • • • , lb-i\ should suffice. Thus an edge 
label can be viewed as a b-ary digit (to the base b). 

4-. Recursively repeat Step\^taking instead of the root node each of the root's 
child nodes. 

5. Assign to values associated to each node in the hierarchy a node label con- 
sisting of a b-ary number constructed from the edge labels, more specifically 
as the concatenation of the b-ary digits labeling the edges along the path 
from the root to the node: the label of the edge starting from the root is 
the most significant one and the edge label closest to the specific node is 
the least significant one. 

6. Let L be the maximal length of the leaf b-ary labels. Append as many Iq 
digits as needed in the least significant positions to the shorter labels so 
that all of them eventually consist of L digits. 

7. Let Tx{0) be the set of b-ary digits in the least significant positions of the 
node labels (the "units" positions); letTx{l) be the set of b-ary digits in the 
second least significant positions of the node labels (the "tens" positions), 
and so on, until Tx{L — \) which is the set of digits in the most significant 
positions of the node labels. 

8. Compute the variance of the sample as 

VarH{Tx) = Var{Txm + ■ Var{Tx{l)) + • • • 

+ 62(^-1) •\^ar(rx(L-l)) (1) 
where Var{-) is the usual numerical variance. 
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In Section r4.2l below we will show that an equivalent measure can be obtained 
in a simpler and more manageable way. 



3 A numerical mapping for nominal hierarchical 
data 

Consider a nominal attribute X taking values from a hierarchical classification. 
Let Tx be a sample of values of X. Each value x G Tx can be associated two 
numerical values: 

• The sample frequency of x; 

• Some centrality measure of x within the hierarchy of X. 

While the frequency depends on the particular sample, centrality measures 
depend both on the attribute hierarchy and the sample. Known tree centralities 
attempt to determine the "middle" of a tree We are rather interested in 
finding how far from the middle is each node of the tree, that is, how marginal it 
is. We next propose an algorithm to compute a new measure of the marginality 
of the values in the sample Tx- 

Algorithm 2 (Marginality of nominal values) 

1. Given a sample Tx of nominal categorical values drawn from X, place 
them in the tree representing the hierarchy of X . There is a one-to-one 
mapping between the set of tree nodes and the set of categories where X 
takes values. Prune the subtrees whose nodes have no associated sample 
values. If there are repeated sample values, there will be several nominal 
values associated to one or more nodes ( categories ) in the pruned tree. 

2. Let L be the depth of the pruned tree. Associate weight 2^~^ to edges 
linking the root of the hierarchy to its immediate descendants ( depth 1 ), 
weight 2^~^ to edges linking the depth 1 descendants to their own de- 
scendants (depth 2), and so on, up to weight 2'^ ^ 1 to the edges linking 
descendants at depth L — 1 with those at depth L. In general, weight 2^~* 
is assigned to edges linking nodes at depth i — 1 with those at depth i, for 
1 = 1 to L. 

3. For each nominal value Xj in the sample, its marginality m{xj) is defined 
and computed as 



xieTx-{x,} 

where d{xj,xi) is the sum of the edge weights along the shortest path from 
the tree node corresponding to Xj and the tree node corresponding to xi. 
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Clearly, the greater m{xj)^ the more marginal (i.e. the less central) is Xj. 
Some properties follow which illustrate the rationale of the distance and the 
weights used to compute the marginality. 

Lemma 1 d(-,-) is a distance in the mathematical sense. 

Being the length of a path, it is immediate to check that d{-,-) satisfies 
reflexivity, symmetry and subadditivity. The rationale of the above exponen- 
tial weight scheme is to give more weight to differences at higher levels of the 
hierarchy; specifically, the following property is satisfied. 

Lemma 2 The distance between any non-root node rij and its immediate an- 
cestor is greater than the distance between nj and any of its descendants. 

Proof: Let L be the depth of the overall tree and Lj be the depth of Uj. 
The distance between nj and its immediate ancestor is 2^~^j . The distance 
between nj and its most distant ancestor is 

1 -I- 2 + h 2^~^^~'^ = 2^~^i - 1 

□ 

Lemma 3 The distance between any two nodes at the same depth is greater 
than the longest distance within the subtree rooted at each node. 

Proof: Let L be the depth of the overall tree and Lj be the depth of the 
two nodes. The shortest distance between both nodes occurs when they have 
the same parent and it is 

2 2"^ — ^3 — ^3~^^ 

The longest distance within any of the two subtrees rooted at the two nodes at 
depth Lj is the length of the path between two leaves at depth L, which is 

2 . (1 + 2 + h 2^~^-'~^) = 2{2^~^i - 1) = 2^~^'+'^ - 2 

□ 

4 Statistical analysis of numerically mapped nom- 
inal data 

In the previous section we have shown how a nominal value Xj can be associated 
a marginality measure m{xj). In this section, we show how this numerical 
magnitude can be used in statistical analysis. 
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4.1 Mean 



The mean of a sample of nominal values cannot be computed in the standard 
sense. However, it can be reasonably approximated by the least marginal value, 
that is, by the most central value in terms of the hierarchy. 

Definition 1 (Marginality-based approximated mean) Given a sample Tx 
of a hierarchical nominal attribute X , the marginality-based approximated mean 
is defined as 

MeanA[{Tx) — arg min m{xj) 
if one wants the mean to be a nominal value, or 

Num_meanM{Tx) ~ rnin m{xj) 

if one wants a numerical mean value. 



4.2 Variance 

In Section [2] above, we recalled a measure of variance of a hierarchical nominal 
attribute proposed in [2] which takes the semantics of the hierarchy into ac- 
count. Interestingly, it turns out that the average marginality of a sample is an 
equivalent way to capture the same notion of variance. 

Definition 2 (Marginality-based variance) Given a samplcTx ofn values 
drawn from a hierarchical nominal attribute X , the marginality-based sample 
variance is defined as 

VarM[Tx) = 

n 

The following lemma is proven in the Appendix. 

Lemma 4 The VarM{') measure and the Var^f(-) specified by Algorithmic in 
Section\^ are equivalent. 



4.3 Covariance matrix 

It is not difhcult to generalize the sample variance introduced in Definition [5] to 
define the sample covariance of two nominal attributes. 

Definition 3 (Marginality-based covariance) Given a bivariate sample Tf^x.Y) 
consisting of n ordered pairs of values {(xi, yi), • • • , 2/n)} drawn from the 
ordered pair of nominal attributes {X^Y), the marginality-based sample covari- 
ance is defined as 

LovarM\T(xX)) = — 

n 
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The above definition yields a non-negative covariance whose value is higher 
when the marginalities of the values taken by X and Y are positively correlated: 
as the values taken by X become more marginal, so become the values taken 



Given a multivariate data set T containing a sample of d nominal attributes 
X^, ■ ■ ■ jX"^, using Definitions [5] and [3] yields a covariance matrix S = {sji}, for 
^ l£ j l£ d and 1 < I < d, where Sjj = VarM{Tj), Sji = CovarM{Tji) for j ^ I, 
Tj is the column of values taken by X^ in T and Tji — (Tj, Ti). 

We can use the following distance definition for records with numerical, 
nominal or hierarchical attributes. 

Definition 4 (SSE-distance) The SSE-distance between two records xi and 
X2 in a data set with d attributes is 



where (5^)5^2 variance of the l-th attribute over the group formed by Xi 

and X2, and {S'^Y is the variance of the l-th attribute over the entire data set. 

We prove in the Appendix the following two theorems stating that the dis- 
tance above satisfies the properties of a mathematical distance. 

Theorem 1 The SSE-distance on multivariate records consisting of nominal 
attributes based on the nominal variance computed as per Definition [H is o dis- 
tance in the mathematical sense. 

Theorem 2 The SSE-distance on multivariate records consisting of ordinal or 
numerical attributes based on the usual numerical variance is a distance in the 
mathematical sense. 

By combining the proofs of Theorems [1] and O the next corollary follows. 

Corollary 1 The SSE-distance on multivariate records consisting of attributes 
of any type, where the nominal variance is used for nominal attributes and 
the usual numerical variance is used for ordinal and numerical attributes, is a 
distance in the mathematical sense. 

5 Conclusions 

We have presented a centrality-based mapping of hierarchical nominal data to 
numbers. We have shown how such a numerical mapping allows computing 
means, variances and covariances of nominal attributes, and distances between 
records containing any kind of attributes. Such enhanced flexility of manipula- 
tion of nominal attributes can be used, e.g. to adapt anonymization methods 
intentcd for numerical data to the treament of nominal and hierarchical at- 
tributes. The only requirement is that, whatever the treatment, it should not 



hjY. 
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modify the numerical values assigned by marginality, in order for the numerical 
mapping to be reversible and allow recovering the original nominal values after 
treatment. 

Appendix 

Proof (Lemma [4]): We will show that, given two samples Tx — {xi, • • • , a;„} 
and Tjj- = {x'l , • • • , x^} of a nominal attribute X, both with the same cardinality 
n, it holds that VarM{Tx) < VarM{T^) if a.nd only if Far// (Tx) < VarH{T^). 

Assume that VarM{Tx) < VarM{T^). Since both samples have the same 
cardinality, this is equivalent to 

n n 

By developing the marginalities, we obtain 

n n 

d{xj,xi)<Y^ Y '^i^'j'^'i) 

Since distances are sums of powers of 2, from 1 to 2^~^, we can write the above 
inequality as 

do + 2di + • • • + <d'^ + 2d[ + --- + 2^-^d'^_^ (3) 

By viewing d^^i ■ ■ ■ dido and d'j^_i ■ ■ ■ d'^d'^ as binary numbers, it is easy to see 
that Inequality ([3]) implies that some i must exist such that di < d[ and rfj < d'~^ 

for i < i < L — 1. This implies that there are less high-level edge differences 
associated to the values of Tx than to the values of T^. Hence, in terms of 
Far jj(-), we have that Var{Tx{i)) < Var{T^{i)) and Var{Txii)) < Var{T^{i) 
ioT i <i < L-l. This yields VarniTx) < VarniT^)- 

If we now assume VarniTx) < VarniT^) we can prove VarM{Tx) < 
VarMiTx) by reversing the above argument. □. 

Lemma 5 Given non-negative A, A', A" , B, B' , B" such that \fA < \J~^ + \/34" 
and ^/B < ^/W + VB" it holds that 

VaTb < ^/A' + B' + sjA" + B" (4) 

Proof (Lemma [5]) : Squaring the two inequalities in the lemma assump- 
tion, we obtain 

A < {V^+V^f 
B < (v^+Vb^)'^ 

Adding both expressions above, we get the square of the left-hand side of Ex- 
pression ([4]) 

A + B < {^/A' + VA")2 + (VW + /B")2 
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= A' + A" + B' + B" + 2{^/A'A" + ^/B'B") (5) 
Squaring the right-hand side of Expression (U) , we get 

(VA' + B' + ^A" + B"f 

= A' + B' + A" + B" + 2^{A' + B'){A" + B") (6) 

Since Expressions ([5]) and ([5]) both contain the terms A' + B' + A" + B" , we 
can neglect them. Proving Inequahty Q is equivalent to proving 

y/A'A" + VS'S" < + + 

Suppose the opposite, that is, 

y/A'A" + VB'S" > v/(A' + 5')(-4" + S") (7) 

Square both sides: 

A'A" + B'B" + 2\/A'A"S'B" > 
{A' + + B") = A' A" + B'B" + A' B" + B'A" 

Subtract A'^" + B'B" from both sides to obtain 

2^/A'A"B'B" > A'B" + B' A" 

which can be rewritten as 

(VA'B" - \/B'A" f < 

Since a real square cannot be negative, the assumption in Expression ([T]) is false 
and the lemma follows. □ 

Proof (Theorem [T]): We must prove that the SSE-distance is non- negative, 
reflexive, symmetrical and subadditive (i.e. it satisfies the triangle inequality). 

Non-negativity. The SSE-distance is defined as a non-negative square root, 
hence it cannot be negative. 

Reflexivity. If Xi = X2, then 5(xi,X2) — 0. Conversely, if (5(x2,X2) = 0, the 
variances are all zero, hence xi = X2. 

Symmetry. It follows from the definition of the SSE-distance. 

Subadditivity. Given three records Xi, X2 and X3, we must check whether 

7 

<5(X1,X3) < 5(X1,X2) -I- ^(X2,X3) 

By expanding the above expression using Expression ([2|), we obtain 



(^2)1 {s'^Y ~ 
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Let us start with the case d = 1, that is, with a single attribute, i.e. = ccj 
for i = 1,2,3. To check Inequahty ([5]) with d = 1, we can ignore the variance 
in the denominators (it is the same on both sides) and we just need to check 



We have 



Sh < + V^is (9) 



.2 n IN rn{xi)+m{x3) 



d{xi,X3) d{x:i,xi) 
= ^ + ^ =d(a;i,X3) (10) 

Similarly = d{xi,X2) and 5*13 = d{x2,x^). Therefore, Expression ([9]) is 
equivalent to subaddivitity for c?(-, •) and the latter holds by Lemma [TJ Let us 
now make the induction hypothesis for d — 1 and prove subadditivity for any d. 
Call now 

A 
A' 

A" := 

12 . V'-' )23 

Subadditivity for d amounts to checking whether 



(52)1 
(^')}2 


H h 

H + 


(52)d-i 

(8^12' 


(^2)1 




(52)d-l 








(^2)1 


H h 


(52)'i-l 


; B' :^ 


{S')f2 


; B" 





VA + S < VA' + B' + \/I^^~KB" (11) 

which holds by Lemma [5] because, by the induction hypothesis for d — 1, we have 
\/~A < y/W + y/W' and, by the proof for d = 1, we have y/B < y/W + \/W . □ 



Proof (Theorem [2|): Non-negativity, reflexivity and symmetry are proven in 
a way analogous as in Theorem [T] As to subaddivity, we just need to prove the 
case d = 1, that is, the inequality analogous to Expression Q for numerical 
variances. The proof for general d is the same as in Theorem [T] For d = 1, we 
have 

2 _ (Xi - X3)2 _ (xi - X^f 2 _ (X2 - 3:3)2 

"-"is ~ 2 ' 12 ~ 2 ' 23 ~ 2 

Therefore, Expression ([9]) obviously holds with equality in the case of numerical 
variances because 

2 _xx--xz _ (Xi - X2) + {X2 - X-i) 



□ 
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